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Toward  a  Persistent  Object  Base 


John  R.  Nestor 


ABSTRACT  To  better  understand  the  needs  of  future  programming  environments,  two  current 
technologies  that  support  persistant  data  in  programming  environments  are  considered:  file  sys¬ 
tems  and  data  base  systems.  This  paper  presents  a  set  of  weaknesses  present  in  these  current 
technologies.  These  weaknesses  can  be  viewed  as  a  checklist  of  issues  to  be  considered  when 
evaluating  or  designing  programming  environments. 


1 1ntroduction 


Every  programming  environment  must  support  not  only  transient  data  that  is  used  during  com¬ 
putation  but  also  support  persistent  data  that  is  kept  over  some  period  of  time.  Two  widely  used 
current  technologies  support  persistent  data:  file  systems  and  database  systems.  There  is  in¬ 
creasing  recognition  that  neither  of  these  technologies  alone  will  provide  an  adequate  basis  for 
the  next  generation  of  programming  environments.  Most  new  environment  efforts  are  moving 
toward  a  more  object  oriented  appro^gtubat-is  a  synthesis  of  ideas  from  fHe  systems  and 
databases.  Some  examples  are  CAIS{0^_85),  the  ESPRIT  Portable  Common  Tool  Environ- 
ment'fESRfttt-65}!  the  Common  Lisp  FramewortyfCLF  65},  find  Arcadia  [Taylor  86]).  This  next 
generation  Qflechnologywill  be  referred  to  as  persistent  object  bases.  N 


To  better  understand  the  nature  of  the  technology  needed  by  future  programming  environments, 
this  paper  considers  the  weaknesses  that  win  have  to  be  eliminated  in  traditional  file  systems  and 
database  systems  to  create  a  first  class  persistent  object  base.  Section  2  sets  the  context  for 
later  sections  by  discussing  the  character  and  needs  of  future  programming  environments.  Sec¬ 
tions  3  and  4  cover,  respectively,  the  weaknesses  of  traditional  file  systems  and  database  tech¬ 
nologies.  Section  5  presents  conclusions.  /  / '  J-*7'  '  '  SyX ' 
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2  Context 


Modem  software  technologies  allow  software  engineers  to  automate  many  of  the  processes  that 
are  often  implemented  by  Inefficient  manual  or  semi-automatic  procedures.  Such  improvements 
increase  our  expectations,  leading  to  larger  software  projects.  Larger  projects,  in  turn,  require 
improved  communications  among  managers,  users,  designers,  and  maintains rs  of  such  projects. 


As  the  software  to  be  produced  grows  in  size  and  complexity  and  the  communication  require¬ 
ments  Increase  in  scope,  the  tools  required  to  develop  and  support  such  software  must  become 
more  powerful,  and  the  computational  system  to  support  the  software  tools  must  grow  propor¬ 
tionately  in  scope.  In  place  of  a  single  batch  or  time  sharing  machine,  increasing  emphasis  is 
being  placed  on  use  of  workstations,  distributed  computation,  and  networks  to  integrate 


previously  separate  computer  systems  into  a  single  vast  communication  and  computational  sys¬ 
tem.  Not  only  must  the  hardware  evolve,  but  the  environment  itself  must  be  developed, 
upgraded,  and  enhanced  over  a  lifetime  of  many  years. 


Programming  environments  have  become  a  focal  point  for  much  of  the  work  directed  toward 
improving  the  practice  of  software  engineering.  Such  environments  provide  support  for  software 
development,  management,  and  maintenance.  There  are  some  primitive  programming  environ¬ 
ments  already  available;  there  are  many  next  generation  environments  currently  being  designed; 
and  work  on  environments  will  be  a  major  technical  thrust  of  software  engineering  for  many  years 
to  come.  There  are  two  top  level  design  goals  that  witl  make  future  environments  successful: 
openness  and  integration. 

Openness  refers  to  the  ability  to  Incorporate  tools,  methodologies,  and  technologies  into  the 
environment  as  needs  and  opportunities  arise.  For  an  environment  to  be  open,  it  must  provide  a 
set  of  interfaces  that  permit  new  features  and  tools  to  be  easily  inserted.  The  degree  to  which  an 
environment  can  be  extended  to  support  a  wide  variety  of  new  tools  and  methodologies  is  one 
measure  of  the  degree  to  which  that  environment  can  be  considered  to  be  open.  Openness  can 
also  be  enhanced  by  the  way  in  which  the  interfaces  are  made  available;  public  availability  (as 
opposed  to  proprietary  control),  quality  documentation,  ease  of  use,  acceptable  performance, 
stability,  portability,  and  standardization  can  all  contribute  to  the  openness,  in  actual  practice,  of 
an  environment. 

Integration  means  that  the  components  of  the  environment  work  together  through  a  uniform 
interface,  style  of  operation,  and  communication  medium.  The  cooperation  of  the  components 
allows  for  better  use  of  information  sharing,  resulting  in  an  intelligent  environment. 

Though  openness  and  integration  are  important  to  software  development  environments,  most 
systems  to  date  have  emphasized  openness  over  integration,  or  vice  versa:  There  are  few  exist¬ 
ing  examples  of  systems  that  achieve  both.  Nevertheless,  this  tension  can  be  resolved  in  a  way 
that  will  enable  both  goals  to  be  achieved;  the  key  lies  largely  in  the  design  of  the  infrastructure  of 
the  environment,  the  kernel  parts  on  which  all  other  tools,  features,  and  methodology  support  are 
built.  If  the  infrastructure  is  not  properly  designed,  increasing  complexity  of  our  environments  and 
the  systems  they  are  used  to  construct  will  make  quality  increasingly  difficult  to  achieve. 

In  earlier  systems,  the  infrastructure  was  provided  mainly  by  the  operating  system,  In  which  the 
primary  concern  was  resource  allocation  and  scheduling.  As  a  result  of  improved  hardware 
technology,  new  software  engineering  tools,  evolving  views  of  the  software  development  process, 
and  ever  increasing  expectations,  a  shift  of  emphasis  has  occurred  in  our  view  of  the  role  of  the 
Infrastructure. 

Persistent  object  bases  are  a  key  part  of  the  infrastructure  of  future  programming  environments. 
Providing  a  high-quality  persistent  object  base  is  a  necessary,  although  not  sufficient,  condition 
for  achieving  the  full  potential  of  future  programming  environments. 
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3  Weaknesses  of  Traditional  File  Systems 

This  section  considers  five  areas  where  traditional  file  systems  are  inadequate  for  persistent 
object  bases:  organization,  abstraction,  history,  attributes,  and  synchronization.  Unix1  {Ritchie 
74]  is  used  here  as  cm  example  of  a  traditional  file  system.  Other  traditional  file  systems  differ  in 
their  details  from  the  Unix  file  system  but  display  essentially  similar  weaknesses. 

3.1  Organization 

The  Unix  file  system  is  organized  as  a  tree  of  files,  each  of  which  is  either  a  directory  or  an 
ordinary  file.2  Within  the  tree,  directories  appear  as  inner  nodes  and  files  appear  as  leaf  nodes. 
The  root  of  the  tree  is  a  unique  directory  from  which  all  directories  and  files  can  be  reached. 
Each  directory  Is  a  mapping  between  file  names  and  the  files  themselves.  Each  file  has  a  unique 
path  name  given  by  the  path  from  the  root  directory  to  the  file.  For  example,  the  path  name 
/usr/bin/nan  is  for  a  file  named  nan  that  is  reached  from  the  root  directory  via  first  the  usr 
directory,  then  via  the  bln  directory. 

One  problem  with  a  tree  structured  file  system  is  that  the  user  is  forced  to  represent  a  system  in  a 
way  that  does  not  reflect  the  structure  of  the  data  In  the  system.  A  related  problem  is  that  as  a 
system  evolves  the  user  periodically  must  do  major  reorganizations  of  the  data  within  the  file 
system.  These  reorganizations  are  needed  because  the  preexisting  hierarchical  structure  in¬ 
creasingly  deviates  from  the  actual  logical  relationships. 

Consider,  for  example,  a  system  being  built  as  part  of  some  project  called  Q_Development.  A 
directory  is  built  for  the  project. 

/project  a  /Q_I>av«lop«aant 

Initially,  all  files  for  the  project  are  placed  in  that  directory.  Soon  the  number  of  files  in  that 
directory  has  grown  to  where  more  structure  is  needed.  Suppose  that  both  documentation  files 
and  program  files  exist.  To  provide  more  structure,  two  new  directories  are  created. 

/pro jacta/QJDavalopeaant /documentation 
/pro  jacta/Q_Davalop«nant/program 

All  of  the  files  are  moved  into  one  or  the  other  of  these  two  directories.  Not  only  is  there  the  extra 
work  Involved  in  moving  the  files  into  the  two  new  subdirectories,  but  any  shell  scripts  that 
referred  to  Qjtoeaiopnent  must  now  be  changed  to  refer  to  one  or  the  other  or  both  of  the  two 
new  subdirectories.  For  a  persistent  object  base,  no  moves  should  be  required  and  existing  shell 
scripts  should  remain  unchanged.  Additional  information  would  be  added  on  top  of  the  existing 
structure. 


'Unix  l»  a  tadamwk  of  AT&T. 

*Thara  am  alao  apodal  Mat  and  Hnk»  that  lor  afcnpCcity  am  not  dscusaad  hare. 


Consider  next  that  it  is  time  to  release  the  Q  system  to  users.  Users  should  have  the  Q  ex¬ 
ecutable  file  and  the  Q  user  manual,  but  not  the  Q  source  code  or  the  Q  internal  documentation. 
These  ties  are  a  subset  of  the  files  in  the  two  subdirectories.  Since  users  should  not  have  to 
know  about  the  substructure  of  the  Q  project  directories  and  be  contused  by  an  those  other  files 
that  don't  matter  to  them,  a  new  directory  is  created  to  hold  copies  of  the  files  that  the  users  will 
need.3 

/rmlease/Q 

Moving  files  was  bad  enough,  but  in  this  case  there  are  actually  two  copies  of  the  same  files.4  For 
a  persistent  object  base,  information  would  be  added,  but  files  would  not  be  moved  or  copied. 

Finally,  consider  that  It  is  time  to  produce  a  new  version  of  the  Q  system  while  leaving  the 
previous  version  of  Q  around.  To  do  this,  the  directories  must  be  split. 

/projects/QJ)evelopaant/documantation/Vl 
/projects /Q_pevelopms nt/doctaaantatlon/V2 
/projects /Q_Development /program/ VI 
/projects  /QJDevelopaasnt  /program/ V2 
/release/Q/Vl 
/r*lease/Q/V2 

Here  all  the  old  files  are  moved  into  the  vi  directories.  The  V2  directories  will  be  used  for  the 
new  version  of  the  system.  A  simple  way  to  do  this  is  to  start  by  copying  all  the  vi  files  into  V2. 
Work  on  the  new  version  then  can  be  done  by  changing  the  V2  file  white  leaving  the  vi  files 
intact.5  Furthermore,  when  versions  were  introduced,  why  wasnl  the  directory  tree  split  in  one  of 
the  following  ways? 

/pro j*cta/Q_Development/Vl /documentation 
/pro  ject  e/Q_Develop«nant /VI /program 
/projects  /{^Development  /  V2  /document  st  ion 
/pro  jects/Q_Development / V2 /program 
/release/Q/Vl 
/release/ Q/V2 

/Vl/projecta/Q_Developmant /documentation 
/VI /pro ject s /{^Development /program 
/Vl/release/Q 

/V2  /pro  ject  s /{^Development  /document  at  ion 
/V 2  /projects  /{^Development  /program 
/V2/relaase/Q 


'At  toast  In  torn*  Hto  tystomt  aymtoic  Unto  could  ba  used  to  avoid  the  oopy  In  Unix,  however,  hard  Inks  can  only  be 
mada  batoraan  a  cSractory  and  a  fito  on  tha  same  physical  vofuma.  Symbolic  links  oan  cross  volumes  but  result  in  an 
asymmetrical  spacfflcaSon  of  a  symmetrical  situation. 

4A  tael  known  eoSwara  anglnaaring  *ruto'  statoa  that  whan  Stare  are  two  Identical  copies  of  tm  tamo  fito  at  toast  on#  of 
Stem  to  dftorant! 

•Some  No  systems  provide  a  search  Hat  mechanism  wham  tie  72  dmctortos  am  Wtiely  empty  and  a  search  Hat  ie  eat 
fiat  semohas  Rrst  V2  titan  vi.  Any  Hma  a  Hto  not  In  v*  must  ba  changed  R  la  first  coplad  into  va.  This  ia  again  sort  ol  a 
soMton  but  is  todoue.  confusing,  and  onor  prone. 


The  answer  Is  that  that  there  is  no  strong  reason  to  prefer  one  of  these  structures  over  the  others. 
In  a  persistent  object  base,  afl  three  of  these  forms  should  be  indistinguishable. 


&2  Abstraction 

Unix  starts  w»h  the  assumption  dial  all  ties  exist  on  the  same  physical  volume  (typically  a  disk). 
In  order  to  deal  with  multiple  physical  volumes,  Unix  has  a  mount  command.  The  mount  com¬ 
mand  has  two  arguments:  an  existing  drectory  and  a  new  volume  that  itself  holds  a  file  system 
consisting  of  a  tree  of  directories  and  ordinary  files.  A  mount  causes  the  file  tree  on  the  new 
volume  to  be  "pasted*  into  the  fie  tree  in  place  of  the  specified  existing  drectory.  The  net  effect 
is  that  there  is  a  strong  coupling  between  the  path  name  of  a  fie  and  Is  physical  location.  As 
Unix  has  come  to  be  used  in  distributed  networks,  several  network  file  systems  have  been 
proposed,  including  Apollo  DOMAIN  (Leach  83],  Sun  NFS  [Sandberg  85],  and  AT&T  RFS  [Hatch 
85].  In  all  of  these  systems,  the  path  name  is  coupled  to  the  physical  placement  within  the 
network. 

Modem  data  abstraction  [Shaw  84]  shows  that  considerable  benefits  can  be  achieved  by  separat¬ 
ing  the  logical  structure  of  data  from  its  representation.  As  can  be  seen  above,  the  Unix  file 
system  blurs  together  the  logical  concept  of  path  name  with  the  representational  concept  of 
physical  location.  Representational  properties  frequently  influence  the  logical  structure  of  data. 
Since  physical  volumes  have  a  finite  maximum  data  size,  the  number  of  files  within  the  subtree 
for  a  volume  is  constrained.  When  the  data  size  exceeds  the  physical  space,  the  user  is  forced 
into  modifying  the  logical  structure.  In  networks,  data  on  a  local  disk  is  often  faster  to  access  than 
data  on  a  remote  node.  By  changing  the  physical  placement  of  data  within  the  network,  and 
therefore  Its  logical  structure,  a  user  can  get  faster  file  access.  In  both  these  cases,  the  user  who 
wants  to  deal  with  the  logical  structure  of  the  data  frequently  spends  considerable  time  also 
dealing  with  the  physical  constraints  of  the  file  system. 

The  Unix  file  system  does  not  support  flexible  physical  representations.  For  example,  there  is  no 
way  in  Unix  to  transparently  store  a  file  in  a  compressed  format  using  text  compression  [Welch 
84]  or  as  a  data  relative  to  some  related  file  [Rochkind  75,  Katz  84].  This  kind  of  transparency 
would  eliminate  the  user  burden  of  explicitly  invoking  a  decompressing  program  before  each  use 
of  the  compressed  file. 

Another  kind  of  flexfcle  representation  is  the  use  of  multiple  cached  copies  of  the  same  file 
[Schroeder  85].  Within  Unix,  caching  can  be  provided  only  by  modifying  the  Unix  kernel. 

In  a  persistent  object  base,  data  abstraction  should  be  practiced  so  that  logical  concepts  are 
decoupled  from  physical  representations;  richer  representations  should  be  possible  by  providing 
the  ability  to  program  the  Implementation  of  fie  abstractions.  The  Apollo  extensble  streams 
mechanism  [Apollo  88]  is  an  example  of  such  a  data  abstraction  mechanism  grafted  on  top  of  a 
Unix  file  system. 
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3.3  History 

Two  related  history  concepts  are  considered  here:  source  versions  and  re-creation. 

Every  time  a  source  file  is  edited,  logically  a  new  version  is  created,  so  that  over  time  a  linear 
sequence  of  versions  is  created.  When  alternatives  occur,  such  as  when  a  bug  is  fixed  in  an  old 
release  while  work  continues  on  the  next  release,  the  sequence  can  fork,  and  when  alternatives 
come  together  separate  sequences  can  Join.  Abstractly,  a  directed  acyclic  version  graph  is 
formed.  Not  all  points  in  the  version  graph  are  equally  important;  in  practice,  users  impose 
additional  structure  at  one  or  more  levels  of  granularity  and  do  not  preserve  versions  below  some 
minimum  level  of  granularity.  The  finest  granularity  corresponds  to  every  edit.  A  coarse 
granularity  would  be  at  major  release  points.  Intermediate  granularities  are  frequently  defined  to 
aid  the  management  of  a  development  project.  The  concept  of  versions  can  be  usefully  extended 
to  multiple  related  source  files  which  may  be  considered  to  be  progressing  in  parallel  along  a 
version  graph. 

One  common  way  of  handling  source  versions  is  through  the  use  of  naming  conventions:  either  at 
the  directory  or  the  file  level.  Earlier  in  this  paper,  directory  naming  conventions  were  used  as  a 
way  of  representing  versions  of  related  sets  of  files.  For  example,  the  two  directories  below 
would  hold  ail  the  Q  system  source  files  associated  with  each  of  the  two  versions. 

project*  /Q_Davwlopmant  /VI 
projects /Q_Dsvslopeasnt  /  V2 

A  method  for  dealing  with  individual  source  files  is  use  of  a  generation  mechanism.  Although 
Unix  provides  no  special  generation  mechanism,  the  same  effect  can  be  realized  by  file  naming 
conventions.  For  example,  two  versions  of  the  same  file  could  be  named  using  a  version  exten¬ 
sion. 

/pro  jscts/Q^DsvslopnsBt/QjCo&trol .  ad* .  VI 
/pro  jact  a  /QJDsvalopinant  /QjCont  rol .  ad* .  V2 

One  disadvantage  of  this  approach  is  that  all  shell  scripts  need  to  be  aware  of  the  generation 
naming  conventions,  and  any  vi  shell  script  needs  to  be  edited  before  It  can  be  used  for  v2.6 
When  using  conventions  for  representing  version  relationships,  the  entire  burden  for  ensuring 
consistency  rests  with  the  user.  Although  a  convention  for  representing  linear  version  relation¬ 
ships  is  obvious,  conventions  representing  forks  and  joins  in  the  version  graph  are  less  dear. 

A  more  sophisticated  source  version  system  is  provided  by  the  Unix  SCCS  tool  and  by  a  similar 
but  improved  tool  RCS  [Tichy  82].  SCCS  keeps  track  of  all  the  versions  of  a  single  source  file.  It 
provides  support  for  both  forks  and  joins.7  The  SCCS  implementation  holds  all  versions  of  a 
source  file  in  a  single  file  called  the  8-file.  Before  any  use  of  a  particular  version  of  that  source 
can  occur,  It  must  be  extracted  explicitly  from  the  s-file.  Typically,  shell  scripts  will  contain  calls  to 


•The  edft  could  bo  avoided  by  pasting  tha  version  as  a  sting  parameter  which  it  ton  concatenated  to  all  Me  names. 
rTh*  SCCS  documentation  suggests  tot  tofts  be  kept  to  a  minimum  to  avoid  stucturai  complexity. 
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SCCS  for  this  purpose.  The  big  disadvantage  of  SCCS  is  that  it  is  an  ad  hoc  data  encoding 
scheme  implemented  on  top  of  the  file  system,  rather  than  as  part  of  It.  In  addition  to  its  logical 
properties,  SCCS  also  uses  the  representational  method  of  source  deltas  to  encode  the  versions. 
This  is  yet  another  example  of  how  logical  properties  and  physical  representation  have  been 
blurred  together. 

In  a  persistent  object  base,  SCCS  functionality  would  be  provided  in  a  transparently  integrated 
manner.  Versions  and  data  compression  would  be  handled  by  orthogonal  mechanisms.8  The 
DSEE  system  [Leblang  85]  is  one  current  example  of  how  this  could  be  done. 

Re-creation  is  the  abfflty  to  be  able  to  go  back  to  an  old  version  of  a  system  and  repeat  all  of  the 
steps  that  were  involved  in  its  creation.  Re-creation  implies  that  all  Information  about  system 
creation  is  captured.  Traditionally,  a  lot  of  the  system  creation  information  was  held  only  in  the 
heads  of  the  development  team,  making  re-creation  difficult.  Re-creation  is  important  for  two 
major  reasons.  First,  V  a  system  is  re-creatabie,  important  structural  relationships  between  the 
ties  of  the  system  are  captured.  System  maintained  can  use  the  relationships  directly  and  use 
support  tools  that  depend  upon  having  the  relationships  available.  Second,  if  an  old  version  of  a 
system  has  a  bug,  re-creation  means  that  a  minor  variation  of  it  can  be  constructed  in  which  the 
bug  is  fixed.  To  better  understand  re-creation,  the  concept  of  a  derivation  graph  is  used.  Deriva¬ 
tion  graphs  were  used  in  Toolpack  [Osterweil  83].  The  definition  used  here  is  a  somewhat 
simpfified  form  of  the  model  presented  in  [Borison  86]. 

Those  files  that  make  up  a  system  can  be  divided  into  primitive  and  derived  files.  A  primitive  file 
is  either  a  source  file  of  the  system  or  some  file  from  outside  the  system  that  is  used  in  its 
construction.  A  derivation  step  consists  of  an  invocation  that  accesses  a  set  of  input  files  to 
produce  a  set  of  output  files.  The  invocation  includes  a  tool  consisting  of  an  "executable*  file  and 
a  set  of  actual  parameters  to  that  tool,  which  could  be  either  constants  or  files  (or  their  names).  It 
is  assumed  that  the  output  files  depend  only  upon  fire  input  files  and  the  invocation  involved  in 
the  derivation  step.9  Derived  files  are  those  that  are  output  of  some  derivation  step.  The  inputs  of 
a  derivation  step  and  the  file  holding  the  tool  being  run  in  the  derivation  step  must  be  either 
primitive  files  or  output  files  of  some  earlier  derivation  step.  The  combination  of  all  the  derivation 
steps  for  a  system  is  its  directed  acyclic  derivation  graph.  A  system  is  re-creatable  if  all  of  its 
derived  files  can  be  re-created  identically.  In  terms  of  a  derivation  graph,  re-creation  is  possible  if 
each  derivation  step  is  known,  the  set  of  primitive  files  is  known,  and  the  primitive  files  have  not 
been  changed  since  the  system  was  first  created. 

In  Unix,  creation  is  often  accomplished  using  the  Unix  tool  Make  (Feldman  79].  Make  applies  a 
set  of  heuristics  to  a  makefile  that  contains  a  list  of  explicit  commands  to  determine  and  run  a  set 


*Version  information,  however,  could  to  uMd  to  guide  heuristics  that  Identify  cendtoates  lor  delta  compression. 

*ln  praeics,  It  is  nscsssary  to  deal  with  Stings  Iks  stops  that  rsad  Sts  system  dock  or  that  Interact  witt  Sts  user.  For 
current  purposes  such  problems  are  ignored. 
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of  invocations.10  Make  is  concerned  with  invocations,  not  with  the  more  general  concept  of 
derivation  steps.  In  general,  it  is  not  posstole  to  tell  the  complete  set  of  input  files  and  output  files 
of  each  derivation  step  of  a  system  by  looking  only  at  the  makefile ;  therefore,  it  is  not  possible  to 
determine  the  set  of  primitive  files  of  a  system.  By  convention,  users  normally  include  information 
about  these  file  sets  in  their  makefiles,  but  there  is  no  check  to  be  sure  that  this  information  is 
complete  or  even  correct.  The  reason  for  this  deficiency  goes  deeper  than  just  the  Make  tool.  In 
Unix,  when  a  tool  is  invoked,  there  is  no  way  to  tell  what  files  are  opened  for  input  and/or  output. 
Such  an  inquiry  is  essential  to  guarantee  re-creation  but  when  arbitrary  tools  can  be  invoked 
during  derivation11,  this  inquiry  can  only  be  implemented  by  making  modifications  to  the  Unix 
kernel. 


For  re-creation,  the  set  of  primitive  files  must  be  determined,  and  each  file  in  the  set  must  be 
checked  to  ensure  that  it  has  not  been  changed  since  initial  creation.  Unix  provides  some  assis¬ 
tance  here  in  the  form  of  a  time  stamp  for  each  file  that  gives  the  time  that  each  file  was  last 
modified.  As  long  as  the  last  modified  time  on  a  file  is  older  than  the  creation  time  of  the  system, 
then  it  would  seem  that  it  is  a  correct  file.  The  problem  occurs  when  the  Unix  move  command, 
nr,  is  used.  This  command  moves  a  file  between  two  directories  and  preserves  its  last  modified 
time.  So  when  mv  is  used,  the  primitive  file  may  not  be  correct  and  re-creation  can  not  be 
done.12  Worse,  there  is  a  system  call,  utimaa,  that  can  be  used  to  change  arbitrarily  the  last 
modified  time.13 


The  use  of  SCCS  protects  the  user  from  changing  old  versions  of  a  source  file.  This  is  a  step 
forward,  although  the  problem  is  still  present  at  a  deeper  level  because  the  s-file  itself  is  subject 
to  all  the  previous  problems. 

In  a  persistent  object  base,  re-creation  would  be  achieved  by  immutable  objects  and  unique 
object  identifiers,  such  as  those  provided  by  the  Cedar  System  Modeller  [Lampson  83].  All  primi¬ 
tive  files  and  the  full  derivation  graph  would  be  stored  as  immutable  objects  whose  content  can¬ 
not  be  changed  by  any  user.  Each  object  is  assigned  a  unique  identifier  at  creation  in  a  way  such 
that  no  two  objects  ever  will  have  the  same  unique  identifier.  The  derivation  graph  would  refer  to 
primitive  objects  by  their  unique  identifier,  not  by  their  file  name;  so  move  and  copy  operations 
would  not  confuse  the  identification  of  the  primitive  objects. 

Keeping  the  information  needed  to  re-create  all  the  old  versions  of  all  systems  on  rotating  mag¬ 
netic  media  is  generally  considered  too  expensive.  Write-once  laser  disks  are  just  starting  to 


,0Although  not  discussed  hero,  Make  alio  uaoa  heuristics  to  avoid  rerunning  too**  derivation  a  tap*  whose  input*  have 
not  changed  since  they  were  run  last. 

,1Toois  where  the  aet  of  input  and  output  He*  ter  an  invoceson  can  be  determined  aesfly  present  no  problem.  The  C 
compiler,  which  can  read  arbitrary  indude  Mas.  involve*  moderate  dfliculty. 

,2ln  practice,  it  oftan  looks  as  though  recreation  happens  correcfry.  Users  frequently  spend  many  oontusing  hours 
when  the  recreated  system  is  aubtfy  different  from  the  original  system. 

’’Tod  designers  have  been  known  to  use  this  oal  conaeuctiveiy  to  lake  Make  into  doing  "the  right  thing*. 
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become  available  [Fujttani  84]:  They  offer  the  ability  to  hold  extensive  historical  information  at 
acceptable  costs.  When  wrtte-once  laser  disks  are  combined  with  the  use  of  compressed 
representations,  there  is  no  reason  why  all  past  versions  of  all  source  files  cannot  be  kept  avail¬ 
able  on-line  [Katz  84]. 

3.4  Attributes 

Unix  provides  a  fixed  set  of  attributes  for  each  file  as  part  of  its  directory  entry.  These  attributes 
include  the  name  of  the  owner  of  the  file,  a  set  of  file  protection  control  bits,  and  times  of  file 
creation,  modification  and  use.  These  attributes  are  mostly  set  and  used  by  the  system,  although 
there  are  commands  and  system  calls  that  permit  the  user  to  set  and  use  them. 

When  additional  attributes  beyond  those  provided  by  Unix  are  needed,  the  user  must  find  alter¬ 
native  ways  of  representing  them,  since  there  is  no  way  to  add  new  attributes  to  a  directory.  The 
set  of  additional  attributes  that  could  be  of  use  in  an  environment  is  unlimited,  being  determined 
by  the  needs  of  a  development  effort  and  the  tools  it  uses.  A  persistent  object  base  must  be  able 
to  support  arbitrary  attributes.  Some  examples  of  additional  attributes  Include  the  unique  iden¬ 
tifier  for  the  file,  a  string  attribute  that  gives  the  reason  why  the  file  was  created,  and  a  boolean 
attribute  that  indicates  whether  the  content  of  the  file  has  been  compressed. 

Attributes  can  be  used  also  to  relate  files.  For  example,  the  version  graph  can  be  represented  by 
each  file  having  as  an  attribute  a  set  of  the  unique  identifiers  of  the  files  that  are  its  immediate 
version  predecessors.  Since  the  version  graph  is  symmetrical,  another  attribute  that  is  a  set  of 
version  successors  could  be  added  also,  but  these  two  attributes  contain  redundant  information. 
To  avoid  the  redundancy  and  to  preserve  the  symmetry,  a  better  way  of  representing  the  version 
graph  would  be  with  a  version  relationship  that  relates  predecessors  with  successors  but  that  is 
not  an  attribute  of  either.  So  in  addition  to  attributes,  a  persistent  object  base  should  support 
arbitrary  relationships.  Other  examples  of  relationships  include  the  derivation  graph,  and  a 
relationship  between  C  program  files  and  the  include  files  they  reference. 

In  addition  to  files  having  attributes,  a  persistent  object  base  should  also  permit  relationships  to 
have  attributes.14  For  example,  the  version  relationship  could  have  an  attribute  that  says  why  a 
successor  version  was  produced  from  some  predecessor  version,  and  a  derivation  step  within  the 
derivation  relationship  might  have  an  attribute  for  the  time  at  which  It  was  run. 

Typically,  when  tools  are  built  on  top  of  Unix  that  depend  upon  attributes  and/or  relationships, 
then  ad  hoc  encoding  means  are  used.  These  means  range  from  special  purpose  file  encodings 
such  as  the  s-fiies  of  SCCS,  through  special  purpose  database  systems  such  as  that  used  in 
Qandalf  [Gandalf  85],  to  general  purpose  database  system  such  as  that  used  in  DSEE.  Not  only 
Is  considerable  effort  wasted  in  building  tools  which  each  must  do  their  own  attribute  and  relation¬ 
ship  support,  but  even  greater  problems  occur  when  tools  that  use  different  ad  hoc  schemes  must 


“S  mmn  matt  nna*  to  hav>  raMonahips  tetwMn  utatomhip*. 


be  integrated.  Consider,  for  example,  two  systems,  each  of  which  uses  its  own  special  ad  hoc 
scheme  and  where  the  information  used  partially  overtaps  so  that  redundant  information  must  be 
synchronized  between  the  two  different  schemes.  The  net  result  is  that  integration  is  very  dif¬ 
ficult.  if  not  impos  stole. 

3.5  Synchronization 

When  two  or  more  users  are  working  on  the  same  system  and  therefore  the  same  set  of  files, 
some  means  of  synchronizing  that  use  is  needed.  When  two  people  are  editing  the  same  source 
file  without  synchronization,  the  changes  made  by  one  may  overwrite  the  changes  made  be  the 
other  without  either  being  aware  of  the  problem.  In  the  absence  of  automated  support,  users 
frequently  do  such  synchronization  by  manual  conventions.  For  example,  a  specific  set  of  files 
are  agreed  to  be  "controlled”  by  some  specified  user  who  may  change  any  of  the  files  while  other 
users  may  read  but  not  modify  the  files.  The  weaknesses  of  this  approach  are  that  it  is  time 
consuming,  error-prone,  and  often  overly  restrictive  in  limiting  modifications.  Under  Unix,  the 
SCCS  tool  provides  support  for  synchronization  at  the  level  of  each  source  file.  SCCS  has  two 
basic  operations  for  synchronization:  gat  a  file  for  editing  from  the  s-file;  and  a»rg*  the  edited 
file  back  into  the  s-file.  Only  one  user  may  have  a  given  s-file  in  the  editing  state,  between  a  get 
and  a  merge.  This  is  overly  restrictive  because  multiple  edits  could  be  proceeding  safely  on 
independently  forked  alternatives.  The  RCS  tool  solves  this  problem  by  permitting  one  edit  to  be 
occurring  on  each  alternative  fork.  Both  SCCS  and  RCS  require  explicit  extra  action  by  the  user 
to  gmt  and  marge  a  file  when  editing  it.  This  is  often  enough  of  a  burden  to  discourage  users 
from  using  either  SCCS  or  RCS.  The  DSEE  system  provides  the  synchronization  in  an  integrated 
fashion  that  is  less  of  a  burden  for  the  user  and  that  is  harder  to  subvert. 

Another  problem  with  SCCS  and  RCS  is  that  the  default  mode  of  operation  for  gat  is  to  extract 
the  most  recent  version  on  the  main  version  line.  At  first  this  seems  like  a  desirable  feature,  since 
most  of  the  work  on  a  system  is  with  the  most  recent  version.  Problems  can  occur,  however, 
when  multiple  people  are  producing  new  versions  of  the  primitive  files  of  a  system.  When  a  user 
changes  some  primitive  files  and  then  does  a  build  based  on  most  recent  versions  of  all  primitive 
files,  the  resulting  system  will  incorporate  not  only  the  user's  changes  but  also  possibly  arbitrary 
other  changes  made  by  other  users  to  other  primitive  files.  The  net  result  is  that  the  behavior  of  a 
most  recently  built  system  will  often  change  over  time  in  subtle  ways  that  are  not  under  the  user's 
control.  Normally,  under  Unix  no  derivation  graph  is  recorded;  thus  it  is  difficult,  if  not  impossible, 
to  figure  out  which  set  of  primitive  files  have  changed  since  the  previous  system  build.  Time 
stamps  are  one  clue  to  what  has  changed,  but  due  to  problems  discussed  earlier,  they  are  not 
always  reliable.  Recall  that  primitive  files  include  not  only  source  files  that  belong  to  the  system 
under  development  but  also  libraries  and  tools  that  are  part  of  Unix.  Normally,  Unix  Ibraries  and 
tools  are  not  version  controlled;  nevertheless,  they  are  changed  during  periodic  operating  system 
releases  and  by  Unix  system  software  malntainers  at  arbitrary  times  to  fix  perceived  "bugs’. 
These  changes  cause  not  only  unpredictable  behavioral  changes  in  the  current  system  builds,  but 
can  also  destroy  the  ability  to  re-create  previous  system  versions. 


A  high  quality  programming  environment  must  support  synchronization  that  is  simple  for  users, 
place  no  unnecessary  restrictions  on  simultaneous  access,  support  a  version  control  system  for 
both  user  and  system  files,  record  the  full  derivation  graph,  allow  the  user  to  control  explicitly 
which  versions  of  primitive  files  to  use,  and  support  inquiry  so  that  users  can  determine  easily 
which  primitive  files  of  a  system  have  been  changed.  A  persistent  object  base  should  provide  the 
basic  support  layer  on  which  such  environments  can  be  built. 

4  Weaknesses  of  Database  Systems 

This  section  considers  five  areas  where  database  systems  are  inadequate  for  persistent  object 
bases:  types,  decentralization,  time,  distribution,  and  performance.  Relational  database  systems 
[Codd  70]  will  be  used  as  examples.  Other  database  systems  will  differ  in  their  details  from 
relational  systems  but  display  essentially  similar  weaknesses. 

Engineering  databases,  particularly  those  used  for  CAD/CAM,  share  many  basic  requirements 
with  programming  environments.  Many  of  the  weaknesses  that  have  been  identified  in  these 
applications  (Hallmark  84,  Hartzband  85]  are  simitar  to  those  discussed  here. 

Relational  database  systems  are  now  just  starting  to  be  used  within  programming  environments 
for  applications  including  source  program  tree  representation,  dynamic  execution  behavior,  and 
version  and  configuration  control  [Ceri  83.  Snodgrass  84,  Linton  84], 

4.1  Types 

Types  in  relational  database  systems  are  considered  here  from  three  perspectives:  primitive 
types,  structural  types,  and  abstract  types. 

Relational  database  systems  typically  have  a  small  predefined  set  of  primitive  types  for 
attributes.15  This  set  is  often  quite  constraining  when  used  for  programming  environments.  For 
example,  consider  using  a  relation  to  represent  the  version  graph. 

Version  :  relation 

old: integer,  —  old  version  number 

new: integer,  —  new  version  number 

why: string  --  reason  for  change 

end 

A  value  for  this  relation  might  be  as  follows. 


,#1n  databaaa  ayatoma,  tha  mm  domain  to  uaad  to  rafar  to  a  *at  of  wluot  tor  aoma  amibote.  Thatorm  «ypa  to  u>od 
hara  ins  toad  of  domain  to  amphasiza  anatogias  with  t»  lypa  maohantoma  of  programming  languagat. 


Here  the  why  field  is  a  string  of  arbitrary  length.  The  only  string  type  provided  by  many  relational 
systems  is  a  fixed  length  string.  The  effect  of  varying  length  strings  can  be  achieved,  but  only  by 
subverting  the  system. 

Vcreionl  :  relation 

old: integer,  —  old  version  number 
new: integer,  —  new  version  number 
count : integer,  —  string  inder 
why: string (20)  —  reason  for  change 

end 


Not  only  is  the  structure  of  the  data  obscured,  but  both  space  to  store  the  data  and  time  to  access 
it  are  degraded.  This  kind  of  subversion  only  gets  worse  when  trying  to  represent  more  complex 
programming  environment  data  such  as  program  source  and  relocatable,  documentation,  and 
graphics.  Although  all  of  these  could  be  built  up  from  the  primitive  types  of  a  relational  database 
system,  the  effort  is  large  and  the  representation  would  be  nekher  natural  nor  efficient. 


Structurally,  a  relational  database  consists  of  a  set  of  named  relations.  Consider,  for  example  a 
database  wth  two  relations. 
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Vara ion  :  relation 

old: integer,  —  old  version  number 

new: integer,  --  new  version  number 

why: string  —  reason  for  change 

end 

Source  :  relation 

version : integer, —  version  number 
day: integer,  —  day  created 

aonth: integer,  —  month  created 

year: integer  --  year  created 

end 

An  of  the  relationships  between  relations  are  expressed  implicitly.  Typically,  two  relations  are 
related  by  using  the  same  type  for  some  attribute  in  each  so  that  they  can  be  Joined.  For  ex¬ 
ample,  version. old  could  be  joined  to  Source  .version.  It  is  not  the  case,  however,  that  if 
two  relations  have  attributes  with  the  same  types  that  it  always  makes  sense  to  join  them.  For 
example,  joining  Version. old  to  Source. month  is  not  a  sensfole  operation.  One  way  to 
introduce  more  structure  is  by  stronger  typing  such  as  that  provided  by  the  Modula-2  type  decla¬ 
ration  [Wirth  85]. 

type  versionjtype  ■  integer; 
type  dayjtype  ■  integer; 
type  monthjtype  ■  integer; 
type  yesrjtype  -  integer; 

Version  :  relation 

old: versionjtype,  —  old  version  number 

new: versionjtype,  —  new  version  number 

why: string  —  reason  for  change 

end 

Source  :  relation 

version: versionjtype, —  version  number 
day:dayjtype,  —  day  created 

month  :aonth__type ,  —  month  created 

year: yesrjtype  —  year  created 

end 

Another  structural  approach  is  to  use  graphical  entity-relationship  diagrams  [Chen  76]. 


Sourc* 


i  :  relation 

version: integer, —  version  number 
day: integer,  —  day  crested 
aonth: integer,  — •  Month  created 
year: integer  ~  year  created 
end 


Version  :  relation 

old : key [ Source ] ,  — 
new: key (Source] , — 
why: string 
end 


old  version 
new  version 
reason  for  change 
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The  key  attribute  is  automatically  supplied  and  initialized  by  the  system.  Since  Joins  now  are 
based  on  unique  surrogate  keys,  structural  relationships  are  specified  fully-  Surrogate  keys  are 
closely  related  to  the  unique  object  identifiers  discussed  earlier  and  to  the  typed  pointers  of 
Modula-2. 

If  surrogate  keys  are  placed  not  only  on  tuples  but  on  entire  relations,  then  relations  can  be  used 
to  relate  other  relations.  For  example,  a  directory  tree  Ike  that  of  a  file  system  can  be 
represented.  First,  a  relation  type  for  directories  is  introduced. 

Type  Directory  ■  relation 
name: string, 
file : key 
end 

As  an  example,  the  following  directory  tree  is  used. 

/(^Development /VI /documentation 
/Q_Development  /vi  /program 
/(^Development /V2 /documentation 
/(^Development /V2 /program 

That  tree  Is  represented  by  the  following  instances  of  the  Directory  type. 
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The  lack  of  abstract  data  types  IShaw  84]  in  relational  database  systems  is  perhaps  the  most 
significant  type  weakness.  Ml  data  in  a  relational  database  exists  at  a  structural  level.  There  is 
no  way  to  define  a  new  abstract  type  in  terms  of  to  abstract  properties  and  then  define  its 
Implementation  in  terms  of  existing  types.  Reconsidering  an  earlier  example,  varying  length 
strings  could  be  defined  as  a  new  abstract  type  that  used  a  variable  number  of  fixed  length 
strings  as  its  representation.  This  kind  of  abstraction  becomes  even  more  important  for  complex 
objects  such  as  those  that  represent  graphic  images. 

Abstract  data  types  gain  much  of  their  power  from  considering  not  Just  data  in  isolation,  but  data 
together  with  the  set  of  operations.  In  relational  database  systems,  the  data  specification  written 
in  some  schema  language  is  separated  from  the  operations  as  expressed  in  some  query  lan¬ 
guage.  Not  only  are  the  specifications  physically  separate,  but  often  they  are  expressed  in  an 
incompatbie  language. 

Another  aspect  of  abstract  data  types  is  that  the  implementation  can  be  changed  without  impact¬ 
ing  the  users  of  the  specification.  Database  systems  normally  provide  users  with  a  limited  level  of 
control  over  the  way  In  which  the  data  Is  represented.  For  example,  many  database  systems 
allow  users  to  specify  those  places  where  redundant  inverted  indexes  are  to  be  created.  When 
the  user  needs  a  representation  that  is  not  supported  by  the  system,  the  only  alternative  Is  to 
modify  the  source  code  of  the  database  system  RseR.  Even  in  those  rare  cases  where  source 
code  Is  available,  the  complexity  of  most  database  systems  makes  this  a  formidable  task.  A 
solution  is  to  put  more  of  the  control  for  representations  in  the  hands  of  the  user  via  an  abstract 
type  mechanism. 

A  persistent  object  base  system  should  provide  a  rich  set  of  primitive  types,  enable  expression  of 
rich  structural  relationships  such  as  those  of  the  extended  relational  models,  and  provide  a  full 
abstract  data  type  mechanism.  There  are  obvious  paralels  between  the  needed  future  direction 
tor  database  systems  and  the  past  evolution  of  type  support  within  modem  programming  Ian- 


guages.  Many  of  the  same  type  features  found  in  modem  languages  need  to  be  brought  into 
database  systems;  however,  database  systems  face  special  problems  brought  about  by  the  per¬ 
sistence  of  data  that  were  not  faced  by  the  designers  of  programming  languages. 


4J2  Decentralization 

Database  systems  typically  have  a  single  centralized  schema  that  is  maintained  by  a  database 
administrator,  DBA.  For  programming  environments,  M  must  be  possible  to  define  and  control 
data  locally.  This  need  is  demonstrated  below  by  several  examples. 

As  an  initial  example,  consider  the  set  of  documentation  ffles  in  a  system  including  help  files,  user 
manuals,  implementation  descriptions,  and  even  the  source  fRes  of  systems.  These  represent 
online  versions  of  information  that  each  user  would  previously  have  had  in  hardcopy.  One  advan¬ 
tage  of  hardcopy  is  that  k  is  easy  for  each  user  to  write  in  personal  comments.  The  same 
approach  could  be  used  onlne  by  letting  each  user  metre  a  copy  of  the  document  and  edit  in 
personal  comments,  but  k  would  be  better  to  have  a  single  copy  of  the  document  and  let  each 
user  be  able  to  have  a  separate  "overlay"  that  contains  personal  comments.  This  kind  of  ability  is 
becoming  available  though  a  class  of  environments  called  hypertext  systems  [Yankeiovich  85] 
Consider  the  toNowing  simplified  relation  types. 

type  Pom— nt  ■  relation 

UnejMefeer :  Integer, 
line : string 
end; 


type  Co—ents  ■  relation 

rlfflon— nt :  key  [Pom— nt] , 
line_nuebe r : integer, 
eon— nt :  string 
end; 


Considerable  progress  toward  decentralzation  is  already  ImpHcR  in  the  use  of  type  definitions  and 
surrogate  keys.  Type  definitions  permit  muRipie  instances.  Surrogate  keys  enable  an  object  and 
Its  attributes  to  be  stored  separately.  This  separation  is  important  not  only  for  local  control  but 
also  because  it  permits  data,  such  as  an  Instance  of  Co— ante,  to  be  added  to  a  preexisting  data 
structure,  such  as  an  instance  of  Pom— nt,  without  modifying  either  the  type  definition  or  con¬ 
tents  of  that  preexisting  data  structure.  Not  only  must  the  database  permit  the  right  kind  of 
definitions,  k  also  must  permit  the  needed  operations.  As  basic  operations,  each  user  must  be 
able  to  create  locally  an  instance  of  c— ate  and  to  control  the  use  of  that  instance.  As  a  more 
general  operation,  users  should  be  able  to  define  their  own  relation  types  for  their  own  local  use. 
In  many  database  systems,  these  abitties  are  centralized  wkh  the  DBA.  Making  a  user  go 
through  a  DBA  for  these  kinds  of  operations  is  not  only  bothersome  but  also  logically 
unnecessary.16 
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As  a  second  example,  consider  what  happens  when  a  new  tool  is  added  to  a  programming 
environment.  This  tool  may  need  the  ability  to  create  and  access  new  attributes  and  relationships 
of  existing  objects.  The  need  for  local  definition,  instantiation,  and  control  are  similar  to  those  of 
the  previous  example.  For  tools,  decentralization  also  can  be  an  aid  to  integration.  Since  each 
tool  can  manage  locally  the  attributes  and  relationships  for  that  tool,  independent  tools  will  not 
place  conflicting  constraints  on  centralized  data.  Conflicts  can  be  representational,  such  as  mul¬ 
tiple  tools  wanting  to  use  word  23  of  some  control  block,  or  naming,  such  as  multiple  tools  want¬ 
ing  to  use  the  attribute  name  Next.  Particularly  severe  conflicts  can  arise  when  two  versions  of  a 
single  tool  are  being  supported  simultaneously.  For  example,  both  versions  might  have  a  vast 
attribute,  but  give  It  slightly  different  semantics.  By  giving  each  version  Its  own  instance  of  the 
west  attribute,  multiple  versions  can  coexist  without  interference.17 

As  a  final  example,  consider  integrating  two  previously  independent  databases.  Such  integration 
could  occur  when  two  isolated  programming  environment  systems  are  connected  via  a  network 
and  a  transparent  network  file  system  is  installed.  When  centralized  schema  are  used,  integra¬ 
tion  will  require  merging  these  two  schema  into  a  single  new  schema.  Conflicts  are  virtually 
certain  to  occur,  forcing  either  massive  recoding  or  a  less  transparent  integration  in  which  the  two 
independent  schema  continue  to  exist. 

For  a  persistent  object  base,  decentralization  of  definition,  instantiation,  and  control  is  essential. 
This  implies  that  there  will  not  be  a  database  administrator  doing  all  data  definition.  Another 
implication  is  that  traditional  kinds  of  normalization  that  are  based  on  a  single  centralized  schema 
cannot  be  done.  Since  normalization  is  a  method  of  removing  redundancy  and  since  controlled 
redundancy  can  be  used  to  improve  the  engineering  of  software  systems,  full  normalization  may 
not  only  be  limited  but  also  undesirable. 

43  Time 

The  basic  relational  data  model  views  the  database  as  having  values  that  vary  over  time.  Every 
attribute  is  considered  to  be  variable  and  only  its  current  value  is  available.  As  was  previously 
discussed,  programming  environments  must  provide  a  history  mechanism  to  record  source  ver¬ 
sions  and  to  support  re-creation. 

In  many  simple  database  systems  that  are  used  to  support  programming  environment  tools,  the 
only  way  to  preserve  history  is  by  making  a  complete  copy  of  the  entire  database.  In  more 
powerful  database  systems,  transaction  journals  are  used  to  preserve  history.  A  similar  capability 
is  provided  in  file  systems  by  periodic  backup  of  all  the  files  on  a  system.  All  of  these  approaches 
have  a  common  weakness:  To  go  back  In  time,  It  Is  necessary  to  manually  substitute  a  previous 
version  of  the  data  tor  the  current  version.  This  substitution  can  be  either  physical  or  logical,  in  a 


17Locd  attribute  Inctanoaa  of  course  do  not  solve  al  integration  problems.  Mechanisms,  such  as  those  In  [Garten  86], 
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physical  substitution,  the  current  version  is  copied  to  a  safe  place  and  then  the  old  version  is 
brought  back  in  the  place  of  the  new  version.  For  logical  substitution,  the  new  version  is  left  in 
place,  but  operations  are  logically  changed  to  operate  on  the  reconstituted  old  version. 

For  a  persistent  object  base,  the  ability  to  record  previous  states  and  to  change  back  easily  and 
transparently  to  an  arbitrary  previous  state  is  essential.  To  capture  this  ability  cleanly  and  safely 
requires  more  than  just  the  addition  of  attributes  that  can  take  on  time  values.  A  persistent  object 
base  must  include  time  as  an  integral  part  of  its  underlying  formal  semantic  model  [Clifford  83]. 

Work  on  temporal  database  systems  [Snodgrass  86]  shows  that  time  can  be  modeled  with  two 
dimensions  whose  axes  are  transaction  time  and  real  time.  The  transaction  time  axis  measures 
the  actual  state  of  a  system  over  time.  By  backing  up  along  the  transaction  time  axis  to  some 
previous  time,  the  system  state  is  logically  returned  to  what  it  was  at  that  previous  time.  By 
backing  up  along  the  real  time  axis,  the  system  state  is  logically  returned  to  the  state  of  reality  at 
that  previous  time  (as  determined  by  our  best  current  knowledge).  Transaction  and  real  times  will 
differ  when  either  there  is  delay  between  the  time  at  which  an  event  occurs  and  the  time  at  which 
It  is  first  entered  into  the  database  or  when  some  event  Is  incorrectly  entered  and  later  corrected. 
Programming  environments  are  unlike  conventional  database  applications  because  the  reality 
that  is  being  modeled  is  data  within  the  database  Itself.  This  implies  that  for  programming  en¬ 
vironments  transaction  time  is  identical  to  real  time.  Suppose,  however,  that  the  concept  of  exact 
modeling  of  reality  is  replaced  by  the  concept  of  exactly  correct  program  behavior.  Now  forward 
progress  of  some  program  under  development  along  the  transaction  axis  represents  increased 
(or  modified)  functionality,  while  forward  progress  along  the  correctness  axis  represents  an  in¬ 
crease  in  the  number  of  bugs  fixed. 

4.4  Distribution 

Most  currently  available  database  systems  require  that  all  data  be  kept  within  a  single  machine. 
Future  programming  environments  will  be  based  on  multiple  machines  connected  via  many  kinds 
of  networks.  Although  considerable  work  is  now  being  done  on  distributed  database  systems, 
current  systems  are  still  rather  limited  [Ceri  84].  Two  aspects  of  distribution  are  considered:  how 
data  is  distributed  among  multiple  machines  and  how  multiple  users  on  different  machines  can 
share  the  same  data. 

The  goal  of  data  distribution  is  to  place  data  physically  so  that  It  is  available  easily  and  quickly  to 
Is  users  while  satisfying  the  hardware  size  constraints.  The  simplest  way  to  distribute  a  relational 
database  Is  to  place  different  relation  instances  on  different  machines.  The  placement  can  be 
static,  determined  when  the  Instance  is  created,  or  dynamic,  changeable  at  any  time.  Indepen¬ 
dently,  the  placement  can  be  manual,  under  the  control  of  the  user,  or  automatic,  under  the 
control  of  the  system. 

A  more  complex  distribution  would  place  different  parts  of  the  same  relation  on  different 
machines.  For  example,  consider  the  version  graph  relation.  Accesses  to  that  relation  will  tend 
to  be  to  tuples  for  recent  versions.  Older  tuples  can  be  placed  on  remote,  slower,  and/or  larger 


physical  devices  of  the  system.18 


When  two  people  are  using  the  same  data,  then  in  general  no  one  place  is  best  for  both.  A 
solution  is  to  permit  separate  copies  of  the  data  to  exist  at  locations  that  are  good  for  each  user. 
When  the  users  are  reading  the  data  and  neither  is  modifying  it,  then  permitting  multiple  copies  is 
easier.  A  special  case  of  read-only  data  is  immutable  objects.  Many  network  file  systems  are 
now  providing  caching,  a  dynamic  automatic  mechanism  for  transparently  creating  and  managing 
multiple  copies  of  data  [Schroeder  85,  Morris  86]. 

Not  all  data  in  a  programming  environment  can  be  immutable.  At  least  some  data  must  be 
mutable  for  progress  to  be  made.  A  simple  mechanism  for  dealing  with  mutable  data  in  the 
presence  of  multiple  users  is  to  use  a  central  server.  A  server  is  a  specific  machine  that  controls 
write  access  to  data.19  The  server  ensures  that  only  one  user  is  writing  the  same  data  at  the 
same  time.  Before  a  user  can  modify  data,  a  lock  is  set  on  the  server  so  that  other  users  cannot 
modify  that  same  data. 

Servers  limit  effective  distribution.  The  problem  can  be  reduced  by  either  minimizing  the  fre¬ 
quency  with  which  a  user  interacts  with  the  server  or  by  modifying  data  in  ways  that  do  not 
require  the  use  of  a  central  server. 

To  understand  how  to  minimize  server  interaction,  it  is  instructive  to  consider  how  multiple  users 
working  on  the  same  system  interact  when  using  a  programming  environment  that  provides  no 
synchronization  for  data  modification.  In  this  case,  the  users  often  invent  manual  methods  for 
synchronization.  Other  than  failures  that  occur  when  someone  forgets  the  state  of  the  manually 
set  locks,  such  methods  work  just  fine.  An  important  distinguishing  characteristic  of  these  manual 
methods  is  the  frequency  of  the  synchronization  operations.  While  common  automated  systems 
often  operate  with  a  frequency  of  many  synchronization  operations  per  second  [Ousterhout  85], 
manual  methods  may  have  a  frequency  of  only  a  few  operations  per  day.  By  implementing 
analogues  of  these  manual  methods,  server  interaction  rates  can  be  lowered.  As  an  example, 
consider  the  directory  tree  of  a  network  file  system.  Every  time  a  new  file  name  is  created,  a 
synchronization  operation  is  needed.  Most  users  on  Unix  systems  are  creating  and  destroying 
files  at  a  high  rate.  To  lower  the  rate  involves  completely  rethinking  the  role  of  global  name 
spaces  in  programming  environments.20 

As  an  example  of  how  data  can  be  modified  without  involving  a  server,  consider  the  version 
graph  relation. 


1#This  tndudts  migrating  old  tuples  to  magnetic  tape. 

,aln  practice,  there  can  be  multiple  servers  as  tong  as  each  data  tom  it  handtod  by  exactly  one  server 

"Names  in  Unix  serve  two  independent  purposes,  connecting  uses  to  their  detinitions  and  communicating  information 
between  users.  Uses  can  be  connected  to  definitions  by  using  unique  identifiers  instead  of  names  Each  unique  identifier 
may  stil  have  a  name,  but  that  name  it  used  for  local  display  purposes  only;  tie  connection  it  made  vie  tie  unique 
identifier.  A  global  name  titan  it  needed  only  in  tiose  relatively  lets  frequent  cases  where  information  it  passed  between 
users.  A  single  global  name  may  be  oommunicated  tor  an  entire  system  tial  totamaly  contains  thousands  of  local  unique 


Version  :  relation 

old: key [Source], —  old  version 
new: key [Source],--  new  version 
idiy: string  —  reason  for  change 

end 

Suppose  that  two  users  each  want  to  create  a  successor  of  some  existing  version.  Each  wHi 
create  a  new  source  object  and  add  a  new  tuple  for  It  to  the  version  relation.  There  is  no  basic 
conflict  between  these  additions.  Since  the  tuples  of  a  relation  are  an  unordered  set,  the  order  of 
the  additions  will  not  affect  the  final  value  of  the  relation. 

Since  a  network  imposes  finite  delays21,  time  within  a  network  is  relativistic  [Lamport  78].  in 
relativistic  time,  there  is  no  system-wide  absolute  dock.  Each  machine  within  the  network  is 
assumed  to  have  its  own  dock  that  progresses  at  its  own  rate.  In  such  a  system,  there  is  no  total 
ordering  of  events.  Consider  again  the  two  users,  *  and  b,  each  of  which  has  a  machine  within 
the  same  network,  both  trying  to  create  a  successor  of  the  same  existing  version.  To  a  it  may 
seem  as  though  the  new  a  tuple  appears  before  the  new  b  tuple,  while  to  b  the  order  appears  to 
be  the  b  tuple  followed  by  the  a  tuple.22 

Now  suppose  that  each  new  version  is  to  be  given  the  next  new  integer  version  number.  This 
can  only  be  done  by  having  a  single  server  that  assigns  those  numbers.  Many  version  manage¬ 
ment  systems  have  a  similar  problem.  When  several  alternative  versions  are  present,  one  of 
them  is  designated  as  the  "primary’  version.  A  central  server  is  needed  to  control  which  new 
version  is  to  be  the  primary  version. 

FOr  a  persistent  object  base,  automatic  dynamic  placement  and  caching  of  relations  and  parts  of 
relations  is  needed.  Various  methods  must  be  used  to  avoid  high  irrteradion  rates  with  central 
servers. 

4  JS  Performance 

The  performance  of  database  systems  is  tuned  to  access  patterns  that  may  be  quite  different 
from  those  expeded  in  programming  environments.  Performance  is  considered  here  in  terms  of 
what  is  accessed  and  who  is  accessing  it.  A  thorough  understanding  of  the  performance  issues 
of  using  databases  for  programming  environments  can  occur  only  after  many  more  experiments 
are  carried  out  and  much  more  analysis  is  done.  Based  on  what  is  still  very  limited  experience, 
this  discussion  speculates  on  areas  where  performance  problems  seem  most  likely  to  occur. 

Relational  database  systems  typically  are  tuned  to  emphasize  the  performance  of  operations  that 
deal  with  entire  relations,  such  as  join  and  projection.  In  a  programming  environment,  operations 
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that  deal  only  with  a  single  tuple  from  each  of  many  relations  may  be  more  frequent.  In  program¬ 
ming  environments,  access  patterns  that  traverse  trees  or  directed  graphs  are  common.  Such 
traversals  must  extract  a  single  tuple,  from  the  relation  that  represents  the  tree  or  graph,  at  each 
step.  Relational  database  systems  typically  assume  that  all  tuples  of  a  relation  are  equally  likely 
to  be  accessed.  In  the  version  relation,  for  example,  tuples  for  more  recent  versions  are  more 
likely  to  be  accessed.  Graph  transitive  closure  operations  are  common  in  programming  environ¬ 
ments.  For  example,  a  query  might  be  to  determine  all  versions  that  are  direct  or  indirect 
predecessors  of  some  given  version.  Database  systems  are  not  normally  tuned  to  make  tran¬ 
sitive  closure  efficient.  Furthermore,  relationally  complete  query  languages  cannot  in  general 
even  express  transitive  closure. 

Many  database  systems  are  designed  mainly  to  interact  with  people.  In  a  programming  environ¬ 
ment,  most  of  the  use  of  the  data  will  be  by  programs.  Most  database  query  languages  are 
interpreted,  not  compiled.  While  users  may  accept  small  delays  due  to  interpretation,  the  heavy 
use  of  data  accessing  programs  may  produce  unacceptable  performance  degradation  in  pro¬ 
gramming  environments.  Of  special  performance  significance  is  the  use  of  surrogate  keys. 
These  keys  serve  exactly  the  same  role  as  pointers  do  in  most  programs:  They  are  used  to  build 
linked  list  structures  such  as  trees  and  graphs.  The  efficiency  of  the  pointer  dereference  opera¬ 
tion  is  known  to  be  a  major  factor  in  determining  the  execution  speed  of  most  system  programs. 
Unless  surrogate  keys  can  be  implemented  with  an  efficiency  approaching  pointers,  then  rela¬ 
tional  databases  may  prove  to  be  an  unacceptable  basis  for  programming  environments.23 

5  Conclusions 

This  paper  has  examined  from  several  perspectives  the  weaknesses  of  file  systems  and 
database  systems  as  a  basis  for  persistent  object  bases  of  programming  environments.  Neither 
current  file  systems  nor  current  data  base  systems  are  adequate  to  support  a  first  class  persistent 
object  base.  In  many  areas,  however,  current  research  and  development  is  progressing  toward 
systems  that  correct  at  least  some  of  the  weaknesses. 

This  paper  provides  designers  and  evaluators  of  persistent  object  bases  with  a  checklist  of  issues 
to  be  considered  and  a  list  of  problem  areas  where  further  work  is  needed.  However,  the  real 
work  of  building  a  persistent  object  base  may  be  less  concerned  with  finding  novel  solutions  to 
specific  problems  and  more  concerned  with  effectively  integrating  current  technology. 
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